Gas adsorption meets deep learning: voxelizing the potential energy surface of metal-organic frameworks

Intrinsic properties of metal-organic frameworks (MOFs), such as their ultra porosity and high surface area, deem them promising solutions for problems involving gas adsorption. Nevertheless, due to their combinatorial nature, a huge number of structures is feasible which renders cumbersome the selection of the best candidates with traditional techniques. Recently, machine learning approaches have emerged as efficient tools to deal with this challenge, by allowing researchers to rapidly screen large databases of MOFs via predictive models. The performance of the latter is tightly tied to the mathematical representation of a material, thus necessitating the use of informative descriptors. In this work, a generalized framework to predict gaseous adsorption properties is presented, using as one and only descriptor the capstone of chemical information: the potential energy surface (PES). In order to be machine understandable, the PES is voxelized and subsequently a 3D convolutional neural network (CNN) is exploited to process this 3D energy image. As a proof of concept, the proposed pipeline is applied on predicting \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\hbox {CO}_{2}}$$\end{document}CO2 uptake in MOFs. The resulting model outperforms a conventional model built with geometric descriptors and requires two orders of magnitude less training data to reach a given level of performance. Moreover, the transferability of the approach to different host-guest systems is demonstrated, examining \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\hbox {CH}_4}$$\end{document}CH4 uptake in COFs. The generic character of the proposed methodology, inherited from the PES, renders it applicable to fields other than reticular chemistry.

where N is the number of framework atoms, r i j is the distance between the j-th framework atom and the probe molecule, r c is the cutoff radius, which was set to 10 Å, and ε i j , σ i j combine the ε and σ value of the probe molecule and the j-th framework atom according to the Lorentz-Berthelot mixing rules: For the probe molecule, ε i /k B = 50 K and σ i = 2.5 Å.For the framework atoms, the corresponding parameters from the Universal Force Field 1 were used.Geometric overlap between a grid point, i.e. the position of the probe molecule, and a framework atom can lead to a very large (even infinite) energy value (highly repulsive interaction), which is detrimental for the training of a neural network.For example, if a single voxel of just one material takes the value of infinity, training can't even start since this infinite value leads to NaNs during standardization (see "Preprocessing and training details" subsection).In order to avoid such problems we fill each voxel with e −βV (r i ) , which tends to 0 as V (r i ) → ∞, where β = 1 k B T , k B is the Boltzmann constant and T is the temperature, which was set at 298 K.

Datasets
Known for their "data hungriness", deep neural networks require relatively large amount of data to unleash their full potential.As such, to get a representative picture for the capabilities of our deep learning method, data from two large databases are used for both MOFs-CO 2 and COFs-CH 4 pairs.In this work, no molecular simulations were performed to generate the labels (adsorption uptakes) of the materials, since the datasets were already labeled.Information regarding the molecular simulations, can be found in the original works 2,3 .

MOFs
With regards to CO 2 , the evaluation of the proposed approach and its comparison with the conventional scheme where geometric descriptors are used, takes place on a subset of the University of Ottawa (UO) database 2 .The latter contains 8 geometric descriptors and CO 2 uptakes at different thermodynamic conditions for 324426 hypothetical MOFs.The geometric descriptors are: unit cell's mass and volume, gravimetric surface area, void fraction, void volume, largest free sphere diameter, largest included sphere along free sphere path diameter and largest included sphere diameter.In this work, the absolute uptake at 298 K and 0.15 bar is examined.A subset of 32432 MOFs serves as a training set, 5000 as a validation set (only for the CNN to select the number of epochs, see "Preprocessing and training details" subsection) and 27438 as a test set.
Please note that the training set size was varied (see "Learning Curves" section) for producing the learning curves shown in the manuscript.As such, the value 32432 is the maximum training set size for the MOFs-CO 2 case.

COFs
The transferability of the approach is evaluated on the COFs database created by Mercado et al 3 , containing 69839 structures.This database provides data for 5 geometric descriptors and CH 4 uptakes at different thermodynamic conditions.The geometric descriptors are: mass density, gravimetric surface area, void fraction, pore limiting diameter and largest cavity diameter.In this work, CH 4 uptake at 298 K and 5.8 bar is examined.A subset of 55871 COFs form the training set and the remaining 13968 constitute the test set.
Please note that the training set size was varied (see "Learning Curves" section) for producing the learning curves shown in the manuscript.As such, the value 55871 is the maximum training set size for the COFs-CH 4 case.

Machine Learning Details
In all cases the geometric descriptors are coupled with the Random Forest algorithm (RF) as implemented in the scikit-learn 4 package (version 1.2.2).The PyTorch 5 framework (version 2.0.1cu+118) was used throughout this work for building and training the CNN models.The architecture of the CNN used in the present study, is summarized in Table 1.
The performance of all models in this work is measured by the coefficient of determination, R 2 .Given a test set of size N, the latter is defined as: where y i , ŷi are the ground truth (reference) value and predicted value of the i-th sample, respectively, and ȳ = ∑ N i=1 y i , i.e. the mean value of y i in the test set.
The R 2 metric compares the mean squared error of a model of interest MSE * (the numerator in Equation 3) with the mean squared error of a baseline model MSE (the denominator in Equation 3), which always predicts the mean value of the test set ȳ.
• When R 2 < 0, i.e. when numerator is greater than the denominator, the model of interest performs worse than the baseline model.

2/8
• When R 2 = 0, i.e. when numerator is equal to the denominator, the model of interest performs as good as the baseline model.
• When R 2 > 0, i.e. when numerator is smaller than the denominator, the model of interest performs better than the baseline model.
The higher the R 2 value the better the model.For a perfect model, i.e. one where ŷi = y i for each sample point, its mean squared error (the numerator) is 0 and as such, R 2 = 1.In all cases where 95% confidence intervals (CI) are presented, they were calculated with the percentile bootstrap method 6 using 10000 bootstrapped samples from the corresponding test set.

Preprocessing
The voxels of each material are standardized "on the fly" based on the training set statistics, prior to entering the CNN (this preprocessing step is applied both during training and inference).That is, the x i jk voxel of a material enters the CNN as x i jk : where µ, σ are the training mean and standard deviation, respectively, obtained as following: The upper limit 25 in the above sums is the value of grid size used in this work, 15625 is the total number of voxels in a voxelized PES (i.e. 25 × 25 × 25 = 15625), while x n i jk is the i jk-th voxel of the n-th material in the training set.In other words, standardization is applied channel-wise (the voxelized PES is just a single-channel 3D image).

CNN training
Regarding CNN training, mean squared error is selected as loss function, weights are initialized according to the He scheme 7 and the Adam 8 optimizer is employed, with learning rate η = 0.001, β 1 = 0.9, β 2 = 0.999 and ε = 1 × 10 −8 .The batch size is set to 64, the CNN is trained for 50 epochs with data augmentation (see "Data augmentation" subsection) and the learning rate is being decayed by a factor of 0.5 every 10 epochs (learning rate scheduling).
The number of epochs was determined by examining the performance of CNN in the MOFs dataset.The CNN was trained with the largest training set (32432 samples) for a different number of epochs, namely 10, 20 and 50, and the value 50 was selected, since it showed the greatest performance (as measured by R 2 ) on the validation set (5000 samples).This value was used for all CNN models trained in this work.In other words, all CNN models are trained with the following setup:   1. RetNet architecture.An empty cell indicates that a parameter doesn't apply for the corresponding layer.For Conv3d layers, "In" and "Out" denote the number of input and output channels (aka feature maps), respectively.For Linear layers, "In" and "Out" denote the number of input and output features, respectively.For the Dropout layer 9 , the parameter p denotes the dropout rate.In all Conv layers, the bias=False option was used.8/8

Figure 1 .
Figure 1.Visualizing the UO dataset.(Left) Dimensionality reduction of the MOFs dataset via PCA.Each point represents a material and is colorized by its void fraction.(Right) Violin plots for the CO 2 uptake.The left and the right end of the black bar represent the first and the third quartile, respectively.White dot denotes the median.

Figure 3 .
Figure 3. CNN performance on the test set (measured by R 2 ) as function of the training set size.Shaded areas correspond to the 95% CI.

Figure 4 .Figure 5 .
Figure 4. Parity plots (top) for the ML models with geometric descriptors (RF model) and energy voxels (CNN model) regarding CO 2 uptake in MOFs.Histograms (bottom) of the R 2 values used to construct the 95% CI.Both models were trained with the largest training set (32432 training samples)."Geometric" stands for geometric descriptors while "Voxels" stands for energy voxels.

Figure 6 .Figure 7 .
Figure 6.Output of the last LeakyReLU layer (i.e.extracted features/fingerpints) of RetNet trained on MOFs with the maximum training set size.The fingerprints for the first 100 materials of the training set are depicted.

Table
CNN with the largest training set sizes took around 35 min and 45 min for MOFs-CO 2 (32432 training samples) and COFs-CH 4 (55871 training samples), respectively, on NVIDIA GeForce GTX 1650 Super (4 GB VRAM).